index.md (6594B)
1 +++ 2 title = 'Models for sequential data' 3 template = 'page-math.html' 4 +++ 5 # Models for sequential data 6 7 ## Sequences 8 They consists of numbers or symbols: 9 - numeric 1 dimensional, e.g. stock price over time. can be n-dimensional 10 - symbolic (categorical) 1-dimensional, like english sequence of words/characters. can be n-dimensional, with multiple categorical features per timestamp (like sheet music) 11 12 we could have one sequence per instance, and try to classify the sequences (like email spam/not spam) 13 or the whole dataset is a sequence, and instances are ordered. 14 15 single sequence feature extraction: 16 - make it a regression problem, each point is represented by the m values before it 17 - gives us a table with a target label (value at time t) and m features (the m preceding values) 18 - you could also use mean/variance statistics 19 - but shit: if the data is shuffled, the classifier is trained on data that comes from the future (relative to the test data) 20 21 major key: think about the real-world use case. e.g. if we want to predict future values, the training data shouldn't contain things that happen later than test data. 22 23 you can do walk-forward validation, if target labels have meaningful ordering in time: 24 25 ![519e53c90f0dae96942ac72ed59aacc0.png](9e5ba66c8e834fc383006a31ff012558.png) 26 27 When modelling probability, break the sequence into its tokens, like words in a sentence. Each token is modeled as a random variable (_not_ independent). 28 So you end up with joint distribution P(W₄, W₃, W₂, W₁) (with some arbitrary number of parameters. 29 30 Can apply chain rule of probability: 31 32 $\begin{aligned} 33 P(W_4, W_3, W_2, W_1) &= P(W_4, W_3, W_2 | W_1) P(W_1) \\ 34 &= P(W_4, W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) \\ 35 &= P(W_4 | W_3, W_2, W_1) P(W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) 36 \end{aligned}$ 37 38 i.e.: can rewrite probability of sentences as product of probability of each word, with condition on its history. 39 with log probability, you get a sum: $\log{P(\text{sentence})} = \sum_{word} \log{P(\text{word} | \text{words before it})}$ 40 41 ## Markov models 42 Markov assumption: limit the amount of memory for previous tokens. e.g. retain a max of 2 words. 43 The "order" is the number of words retained in the conditional. 44 45 For example, if the conditional is $P(x | \text{i, will, graduate, in, a, decade})$ and it's a third-order model, the Markov assumption is $P(x | \text{i, will, graduate})$. 46 47 With Markov assumption and chain rule, can model sequence as limited-memory conditional probabilities. These can be estimated from a corpus (huge piece of text). 48 49 For example, to estimate prob of the word 'prize' given "won a", count how often "won a prize" occurs in text as proportion of total occurrences of "won a": 50 51 $P(\text{prize} | \text{a, won}) \approx \frac{\text{\# won a prize}}{\text{\# won a}}$ 52 53 The word snippets are "n-grams". Three words is a trigram, two words is a bigram. I guess one word is just a gram. And maybe 1000 words would be a kilogram. 54 55 Sequential sampling: start with small seed of words, then sample next word according to its probability given the previous words. 56 57 ## Embedding models 58 Model object x by embedding vector e<sub>x</sub>. The similarities of these vectors represent similarities between words. 59 Creates embedding vectors for words, where distances and directions reflect semantic meaning. 60 61 Distributional hypothesis: words that occur in same context often have similar meanings. 62 63 1-hot vector: represent words as atomic objects in a monolithic vector 64 65 Word2Vec: 66 - slide context window over sequence, trying to predict distribution P(y|x) - which words likely to occur in context window given middle word 67 - create dataset of word pairs from text 68 - feed this dataset to two-layer network, which predicts context 69 - softmax activation over 10k outputs is expensive, so need some tricks to make it feasible 70 - after training, discard second layer (softmax) and only use embeddings produced by first layer 71 72 ## Recurrent neural networks 73 Neural network with cycles in it (used for sequences). 74 75 Can be used for: 76 - sequence to sequence, e.g. translating English to French 77 - sequence to label, e.g. sequence classification 78 - label to sequence, e.g. sentence generation 79 80 Example, fully connected network with input x extended by three nodes, to which the hidden layer is copied: 81 82 ![128d6b585d9c1bdaf5a568bc022d8763.png](761f27ac323e4070bf7df80a540a6831.png) 83 84 Visual shorthand: 85 - rectangle is vector of nodes 86 - arrow feeding into the rectangle annotated with a weight matrix means fully connected transformation 87 - if line doesn't have weight, it's a copy of input vector 88 - if two lines flow into each other, concatenate their vectors 89 90 ![d32563c1b461ded32d46894a7de2198d.png](44c66910d5c84b08a5a833c3633f9595.png) ![afb898aeecf247a602f622999280ed61.png](92011008806443b784e5f7bc2d3b0a77.png) 91 92 Training RNNs: 93 - provide input seq x, target seq t 94 - backpropagation through time: 95 - unroll: 96 - every step in seq is applied in parallel to copy of the network 97 - recurrent connection flows from previous copy to next 98 - the whole thing is a feedforward net (network without cyles) 99 - hidden layer inits to zero vector 100 101 ![5a22d7340c41f70a48e9815f7576ec3a.png](82d2db6b48d4409582ac544450821b12.png) 102 103 Basic RNNs work well, but don't learn to remember information for a long time. 104 Can't have a long term mem for everything, need to be selective. In order to remember things long term, you need to forget a lot of other stuff (such is life). 105 106 ## LSTMs 107 "Long short-term memory". 108 Selective forgetting and remembering, controlled by learnable "gates". Side note, from now on I'm not "studying", but I'm "selectively forgetting and remembering". 109 110 *[tanh]: sigmoid rescaled so its outputs are between -1 and 1 111 112 The gating mechanism takes two input vectors, which are combined with sigmoid and tanh activations. 113 It produces an additive value -- want to figure out how much of input to add to some other vectors. 114 The tanh is like a mapping of input to range(-1, 1) -- limits the effect of the addition vector. 115 The sigmoid is like a selection vector. 116 117 ![c60c95b58a2975c98df7c548b0585b9b.png](e8795f123d904061a5f2ae90dfdc2c4e.png) 118 119 Basic operation of LSTM is a "cell". There are two recurrent connections between cells: the current output y, and the cell state C. 120 121 I don't yet know how much detail we need to know about this, so I'll fill it in later based on exam questions. 122 123 The prof's summary: "incredibly powerful language models. Tricky to train, very opaque." Yep, opaque and complicated, indeed.